Learning properties of Support Vector Machines 
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In this article, we study the typical learning properties of the recently proposed Support Vectors 
Machines. The generalization error on linearly separable tasks, the capacity, the typical number 
of Support Vectors, the margin, and the robustness or noise tolerance of a class of Support Vector 
Machines are determined in the framework of Statistical Mechanics. The robustness is shown to be 
closely related to the generalization properties of these machines. 



Support Vector Machines, recently proposed to solve 
the problem of learning classification tasks from exam- 
ples, have aroused a great deal of interest due to the 
simplicity of their implementation, and to their remark- 
able performances on difficult tasks |]||. Classification 
of data is a very general problem, as many real-life ap- 
plications, like pattern recognition, medical diagnosis, 
etc., may be cast as classification tasks. In the last few 
years, much work has been done to understand how high- 
performance learning may be achieved, mainly within 
the paradigm of neural networks. These are systems 
composed of interconnected neurons, which are two-state 
units like spins. The neuron's state is determined, like 
in magnets, by the sign of the weighted sum of its in- 
puts, which acts as an external field, and of the states of 
its neighbors. Learning with neural networks means de- 
termining their connectivity and the weights of the con- 
nections. The aim is to classify correctly not only the 
examples, or training patterns, but also new data, as we 
expect that the learning system will be able to general- 
ize. A single neuron connected to its inputs, the simple 
perceptron (SP), is the elementary neural network. It sep- 
arates the input patterns in two classes by a hyperplane 
orthogonal to a vector whose components are the con- 
nection weights. Thus, the SP can learn without errors 
only Linearly Separable (LS) tasks. Most classification 
problems are not LS, requiring learning machines with 
more degrees of freedom. However, the relationship be- 
tween the machine's complexity, its learning capacity and 
its generalization ability is still an open problem. Feed- 
forward layered networks, the multilayered perceptrons, 
are the most popular learning machines. Their architec- 
ture is usually found through a trial and error procedure, 
in which the weights are determined with Backpropaga- 
tion j^], a learning algorithm that performs a gradient 
descent on a cost function. Its main drawback is that it 
usually gets trapped in metastable states. Growth heuris- 
tics that avoid using Backpropagation have also been pro- 
posed j|. 

Support Vector Machines (SVM) are an alter- 

native solution to the learning problem, whose typical 
properties have not been studied theoretically yet. The 
idea underlying SVM is to map the patterns from the 



input space to a new space, the feature- space, through a 
non-linear transformation chosen a priori. Provided that 
the dimension of the feature-space is large enough, the 
image of the training patterns will be LS, i.e. learnable 
by a SP. It is well known that, if the training set is LS, 
there is an infinite number of error-free separating hyper- 
planes. Among them, the Maximal Stability Perceptron 
(MSP) has weights that maximize the distance of the 
patterns closest to it. The SVM weight vector is that of 
the MSP in feature-space. The patterns closest to the 
separating hyperplane are called Support Vectors (SV) ; 
their distance to the hyperplane is the maximal stability 
or SV-margin. The important point is that the SV deter- 
mine uniquely the MSP. Their number is proportional to 
the number of training patterns, and not to the dimen- 
sion of the feature-space (which may be huge). Thus, 
increasing the feature-space dimension does not neces- 
sarily increase the number of parameters to be learned, a 
fact that makes the SVM very attractive for applications. 
For example, in the problem of digit recognition pL the 
input space of dimension 256 needs to be mapped onto 
a space of dimension 256 7 ~ 10 16 , but the number of 
parameters to be determined is as low as 422. However, 
in spite of the high performance reached by SVMs in re- 
alistic problems |2j , a theoretical understanding of their 
properties is still lacking. 

We consider, within the framework of Statistical Me- 
chanics, SVMs defined by particular families of mappings 
between the input-space and the feature-space. We ad- 
dress several important questions about these machines. 
The generalization error in the particular case of learning 
a LS task is shown to decrease slower than that of a SP 
(in input-space) as a function of the number of training 
patterns. The capacity increases proportionally to the 
dimension of the feature-space. The number of SV and 
the SV-margin present interesting scaling with the num- 
ber of features. The probability of misclassification of 
training patterns corrupted after learning is shown to be 
a decreasing function of the SV-margin. This property, 
that we call robustness or noise-tolerance, may account 
for the good generalization performance of SVMs in ap- 
plications. 

We assume that we are given a training set of P in- 
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dependent TV-dimensional vectors, the training patterns 
... P , and their corresponding classes r M = ±1. 
The patterns are supposed to be drawn with a probability 
density P(£) = (2n)- N / 2 exp (-£ 2 /2), and the classes r 
are given by an unknown function r(£) called supervisor 
or teacher. 

We focus on SVMs defined by a nonlinear transforma- 
tion $ that maps the iV-dimensional input space to a 
(k + l)iV-dimensional feature-space through 



e->*(e) = {€,#Ai)e,---,#Afc)€}. 



(i) 



where the Xi are functions of The components </>(Ai)£ 
(i = 1, • • • , k) are the new features that hopefully should 
make the task linearly separable in feature-space. 

In the following we consider odd functions 0, and 
Xi = £-Bi where {Bi} i=1 ... k is a set of k unitary orthogo- 
nal vectors (B^ • B j — 5ij ) . With this choice, the new fea- 
tures are uncorrelated. For example, the k first genera- 
tors {ei, e2, • • • , efe} of the input space (ei = (1, 0, • • • , 0), 
e2 = (0, 1, 0, • ■ • , 0), • • •) are one possible realization of the 
B^. In the thermodynamic limit considered below, any 
set of k randomly selected normalized vectors B^ satisfies 
the orthogonality constraint with probability one. The 
functions </>(A) = sign(A) and </>(A) = A are of particular 
interest. If k = N, a SVM using the latter can imple- 
ment all the possible discriminating surfaces of second 
order in input space. More complicated transformations 

equivalent to higher order surfaces, may be considered 
(for examples, see [Q). 

The output of the SVM to a pattern £ is a — 
sign (J • $(£)), where J = {J , Ji, ■ ••, J fc } is a (k + l)N- 
dimensional vector. Hereafter we consider normalized 
weights, J-J = (fc+l)iV without any lack of generality, 
but we do not impose any constraint to the normaliza- 
tion of each TV-dimensional vector . The stability of a 
training pattern ^ of class r' 1 in feature-space is 



r 



vf+W' 



(2) 



Geometrically, |7 M | is the distance of the image of 
pattern £ M to the hyperplane orthogonal to J. The aim 
of learning is to determine a vector J such that cr M = , 
or equivalently 7 M > 0, for all fx. Any vector J that 
meets these learning conditions separates linearly, in the 
feature-space, the image <!> of patterns with output +1 
from those with output —1. Due to the non-linearity 
of $, this separation is not linear in input space. More 
general SVMs, that use a Kernel iV(J,$(^)) instead of 
the inner product in Eq.(Q), have been proposed but 
we restrict to the inner product in the following. 
The SV-margin is 



Kmax(J*) = maxinf 7 M , 
j p 



(3) 



where J*, the MSP weight vector in feature-space, 
is a linear combination of the SV lILpfl, J* = 



J2fj,esv x^r^Q^). The are positive parameters to 
be determined by the learning algorithm, which has to 
determine also the number of SV. Generally, this num- 
ber is small compared with the feature-space dimension, 
a fact that allows to increase the latter considerably with- 
out increasing dramatically the number of parameters to 
be determined. The SVM in input-space (k = 0) or linear 
SVM is the usual MSP, whose properties have extensively 
been studied (see Q and references therein). 

We obtain the generic properties of the SVM through 
the by now standard replica approach [ fL5| . Results are 
obtained in the thermodynamic limit, in which the input 
space dimension and the number of training patterns go 
to infinity (N — > +oo, P — > +oo) keeping the reduced 
number of patterns a = P/N constant. In this limit, the 
SVM properties are independent of the training set. The 
appropriate cost function is ^(J, C a , k) = ©(k — 7 m ), 
where is the Heaviside function and C a represents the 
training set. It counts the number of training patterns 
that have a stability smaller than n in feature-space. The 
largest value of n that satisfies E(3* , C a , k) = is the 
SV-margin. The weight vector J* defines the SVM. Its 
generic properties are determined by the free energy 



f = lim lim (In Z) 



(4) 



where Z = J dP(3) exp (— (3E(J, £ a , n)) is the partition 
function, dP(3) = dJ S((k + l)N - J J) and (3 is an 
inverse temperature. In Eq. (|j), the bracket stands for 
the average over all the possible training sets C a at given 
a. If the problem is LS, then / — for k > 0, meaning 
that error-free learning is possible. But in general, the 
probability of error-free learning vanishes beyond some 
value of k. The maximal value of k for which / = is the 
typical value of K max (A;, a). The free energy is calculated 
using the replica trick (InZ) = lim n _ +0 m (^™) l n - 

We consider first the case of a teacher that is a SP 
in input space, of (unknown) iV-dimensional normalized 
weigths K (K • K = 1). Thus, the classes of the train- 
ing patterns ^ are t m = sign (K • £ M ). In this case, an 
error-free solution exists for all a, and we are interested 
in the generalization error e g {k,a), which is the proba- 
bility that the trained SVM misclassifics a new pattern 
£. Clearly, we do not expect that a SVM with k > 
will perform well on this task, as it corresponds to a case 
where the a priori selected feature-space is too complex. 
However, this may well be the case in real applications. 
We begin by considering this LS problem mainly because 
other properties considered below, like the capacity and 
robustness, can easily be deduced by disregarding, or set- 
ting to zero, some of the order parameters introduced 
here. These are, 
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are the weight vectors of 
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replicas a and b. The cross-overlaps Jf ■ J b (i ^= j), and 
K • B, may be neglected for k <C N, as they are of or- 
der The parameters cf b are a generalization of 

In fact, 
e percep- 



the parameter x = lim^ +00 8(1 — q ) in 
Gardner and Derrida considered a singL 

tron (k = 0) with normalized weights Jo (Jo ■ Jo = N), 
so that (Jg - J b ) 2 /2N = 1 - Jg • J b /N = l-q ab in their 
notations. We assume replica symmetry, i.e. R a = R, 
vf = Vi, cf b = a for all a, b. The parameter R represents 
trivially the overlap between the first N components of 
vector J with the teacher K. The overlap between Jj 
and K may be neglected for i > 1, since for odd func- 
tions <j) and uncorrelated vectors the new features are 
uncorrelated. If <f> were even, this would not be the case. 
The parameters Vi are proportional to the norm of the J.; . 
The sense of the parameters c, is more involved. They re- 
flect how fast the fluctuations of Ji around the minimum 
of the cost function decrease as the temperature vanishes 
(8 — > +oo). In the case of a degenerate continuum of 
minima, these fluctuations decrease too slowly, and the 
Cj diverge. This is the case for k < K max . 

A symmetry between the k vectors J.;, i > 1, due to 
the invariance with respect to permutations of the B^, 
together with the fact that the Bj are uncorrelated with 
K, allows to take Vi = v\ and Cj = ci for i > 1. In- 
troducing v\ = Vi/vq, where vq is determined by the 
normalization condition J • J/iV = fc + 1 = uq + kviVo, 
c\ = ci/cq and co = co/(l + k), the free energy is 
f(k,a,K) = maxik ci.co mhiRg(k, a, «; v%, ci, c , R), with 



/ t 7->\ ci (1 — R 2 )+kvi 

g{k,a,n;vi,ci,co,R) —■ 



2 c ci (l+kvx ) 
(re-a/a) 2 jj 



+ %fD\ 1 ...jD\ k J™_ b Dy 

+ 2afDX 1 ...fDX k J^ b DyH(-%). (6) 

Dy = dycxp(-y 2 /2)/V2n, H(x) = f+°° Dy, and a, b, e 
stand for 
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e = l-R 2 + v 1 ^24> 2 {X i ). 
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The generalization error e g (k,a) writes 
e g (k,a) = — J D\\ - J DXk arccos 
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Ve + R 2 
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(7b) 
(7c) 

, (8) 



where R and e extremize g(k, a, k; v±, ci, Co, R). In par- 
ticular, the maximal stability K max (fc, a) is the largest 



value of k that satisfies £o(a, k) = +oo since / is non 
zero for finite values of c . 

If (j>(X) = sign(A), the extremization of @ with respect 
to vi and c\ gives v\ = 1 — R 2 and c\ = 1. Notice that 
for R = 1 (which corresponds to a = oo), v\ — (thus, 
v\ = 0) as expected: the new features are irrelevant be- 
cause the task is LS. The fact that c% = 1 means that 
the fluctuations of Jo and Ji, i > 1, have the same be- 
haviour in the limit 8 — > oo despite the fact that their 
norms are different (vi ^ 1). After introduction of these 
values for v\ and c\ in (g), we obtain g(k,a, k;co, R) = 

g(o,a/(k + l), k; c , R/y/T+ k(l — R 2 )^j , where the 

right hand side term corresponds to a SP trained with 
a training set of reduced size aj(k + 1) having an overlap 
R/y/l + k(l — R 2 ) with the teacher. After introducing 
these values of the order parameters in j7c[ ) and (ph , we 
obtain e g (k, a) = e 9 (0, a/(k + 1)). As expected, the gen- 
eralization error of the SVM with k > on a LS task is 
larger than the one of the linear SVM. This is due to an 
entropic effect, as the SVM's phase space grows with k 
whereas the size of the space of functions considered, lim- 
ited to the LS ones, remains the same. For large a, the 
generalization error vanishes as 0.5005 (k + l)/a, to be 
compared to the linear SVM that has e g ~ 0.5005/a [[b|. 

From the above scaling, the SV-margin and the 
number of SV follow from the maximal stabil- 
ity K max (0,a) and the distribution of stabilities 
p(0,a;7) of the MSP in input space @. We ob- 
tain p(k,a;j) = ( y /2/Tr)p 1 (k,a)Q {j ~ K max (k,a)) + 
Po(k,a)5(-f- K max (k,a)) where p\{k,a) = 
H[— 7/ tan (neg(k, a))] exp(— j 2 /2) and po(k,a), the typ- 
ical fraction of training patterns that belong to the SV, 
is such that p(k,a;-f) integrates to one. For a <C 1, the 
SV-margin is K max (k,a) ~ y/(k + l)/a and po^a) ~ 
1 — ^J2a/n(k + 1) cxp — (k + l)/2a, meaning that in that 
limit almost all the training patterns are SV. For a — > 
00, K max (k,a) ~ 0.226\/27r(fc + an d Po{k,a) ~ 

0.952(fc+ 1)/ a, i.e. the typical number of SV is slightly 
smaller than the feature-space dimension. Solutions for 
other functions <f> are more complicated, and we were not 
able to find a closed expression of e g (k, a) for all a. The 
function <fi that gives the smallest generalization error at 
given k, at least for small a, is </>(A) = sign(A). But the 
fact that the generalization error increases with A: is a 
general property, independent of the function <fi. 

We turn now to the more interesting problem of the 
capacity, defined as the typical number of dichotomies 
that the SVM may implement, a quantity closely re- 
lated to the VC dimension of the learning machine M. 
We consider training sets where the patterns' classes 
are given by a random teacher, that selects outputs +1 
and —1 with the same probability 1/2. In this case, 
the order parameters are ( |5b| ) and (pc|). The free en- 
ergy is f(k,a,K) = maxox,ci,co 9(k, a, /c; vi, c\, cq) where 
g(k 7 a, k; vi , ci , Co) is obtained from (^|) and (0) by setting 
R = 0. 
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The capacity a c (k), the largest reduced number of pat- 
terns that the machine can learn without errors, corre- 
sponds to a vanishing SV-margin, i.e. n m ax{k, ct c (k)) = 
0. In this case, the extremae of g(k, a, 0; vi, c\, 5q) cor- 
respond to Co(a, k) = +00 and v\ — c\ for all the 
possible functions <f>. This result means that the ca- 
pacity is a c = 2(k + 1), independently of <f>, provided 
that the new features are uncorrelated. This result gen- 
eralizes to other feature-spaces the value deduced by 
Cover [jl2j through a geometrical approach that Mitchi- 
son and Durbin jp| generalized to the case of quadratic 
separating surfaces. Notice that the latter corresponds to 
a SVM with <f>(\) = A and k = N. The capacity of SVMs 
is smaller than the one of multilayered perceptrons with 
one hidden layer of k + 1 neurons, which have the same 
number of degrees of freedom. For example, the capacity 
of the parity machine scales like k In k, and that of the 
committee machine like fcVhi k, for large k [p]JTc|] . 

It turns out that in the case </>(A) = sign(A), the max- 
imal stability K max (fc, a) scales trivially with k. The or- 
der parameters are v% = c,\ = 1 so that g(k, a, n; Cq) = 
g(0,a/(k + 1),k;c ), where the RHS corresponds to a 
SP of margin k in input space. The maximal stability 
is thus K max (k,a) = K max (0, a/(k + 1)) From Q we 
deduce that for a <C 1, K max (fc,a) ~ y/{k + l)/a, and 
for a — * aj, n ma ^(k,a) ~ \/tt/& (2(fe + l)/a — 1). If 
0(A) = A, the property K max (fc, a) — K max (0, a/fc) is cor- 
rect for a <C fc. As K max (0, a) is a concave decreasing 
function of a ||] , including new features may result in a 
large increase of the SV-margin. 

In most classification problems we expect that similar 
patterns belong to the same class. In that case, hav- 
ing a large SV-margin may be benefical for the general- 
ization performance. In particular, if slightly corrupted 
versions of the training patterns are presented to the 
trained SVM, its output should not change. We con- 
sider a SVM that achieved error-free learning with a SV- 
margin K max > 0. We assume that the training pat- 
terns are corrupted, after the learning process, through 
£ M — > $ >J '+T] tJ -, where 77^ are randomly distributed vectors 
with probability distribution: 



P(ri) = (27rA)- JV / 2 exp(-j 7 2 /2A 



(9) 



We are interested in the classification error of the SVM 
on the corrupted patterns, defined as et(k, n, A) = 
Y^ ai (ct a '(A) — r M ) 2 /(4F) where r' 1 is the original pat- 
tern's class and er^A) the SVM's output to the corrupted 
pattern. The dependance on a is implicitly included 
through k = Kma.x{k,a). et caracterizes the robustness 
of the SVM with respect to a small pattern's corruption 
(A <C 1). Input vectors close to a training pattern will 
be given its same class with probability 1 — et(n, A). 

In the case of the linear-SVM (the SP), a straightfor- 
ward calculation gives 



If the margin is k — 0, one half of the training patterns 
have zero stability, and e t (0,0, A) > 1/4. Thus, any 
small perturbation results in misclassifications. If k > 0, 
then e t (0, k, A) ~ exp(— k 2 /2A 2 ) for small A. Consider 
next the general SVMs. If </>(A) = sign(A) and k > 0, 
et(k, k, A) ~ A for small A. In comparison with the SP, 
the robustness of SVM is poor. This is due to the dis- 
continuity of the function <fi, as a small perturbation of 
the input pattern may produce a strong perturbation on 
its stability. On the contrary, for continous functions </>, 
like </>(A) = A, and small A, e t (k, n, A) ~ exp(— h(k)n/A) 
where h(k) is an increasing function of k. Thus, conti- 
nous functions cj> are preferable for improving the SVM's 
robustness or noise tolerance. 

In conclusion, we presented the first study of the typ- 
ical properties of a class of SVMs. We determined, as a 
function of the number of new features and the number of 
training patterns, the fraction of SV, the behaviour of the 
margin, the generalization error on a linearly separable 
task, the capacity and the probability of misclassifica- 
tion of training patterns slightly corrupted. Our results 
may explain why maximizing the margin is so important : 
the probabilty that the trained SVM will assign the same 
class to the corrupted as to the original training patterns 
is enhanced by large margins. 



e t (0,K,A)=H(-K)H(K/A) 



+ OO 



DzH(z/A). (10) 
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